fix(neo4j): self-healing cross-session edge writes + unified :Node identity (v4.0.2) by colombod · Pull Request #21 · microsoft/amplifier-context-intelligence

colombod · 2026-06-19T13:43:04Z

Summary

Cross-session graph edges were silently dropped when an endpoint node was not yet committed at flush time. This makes the chunked-flush edge writer self-healing and unifies node identity on the universal :Node label so the fix holds across the whole ingest pipeline. Bumps the server to v4.0.2.

Root cause

_edge_merge_cypher built every cross-session edge with two MATCH clauses:

UNWIND $rows AS row
MATCH (src:Node {node_id: row.src_id, workspace: $workspace})
MATCH (dst:Node {node_id: row.dst_id, workspace: $workspace})
MERGE (src)-[r:TYPE]->(dst)

The two MATCH clauses are an inner join: if either endpoint was not yet committed, the row produced no bindings, the MERGE never ran, and the edge was dropped with no error, no log, no warning. This affects every cross-layer/cross-session edge the pipeline writes (SOURCED_FROM ×27, plus HAS_PART, TRIGGERED, HAS_STEP, and ~20 other edge types), all of which legitimately reference endpoints that may not exist at the writing flush.

Fix

_edge_merge_cypher: MATCH → MERGE both endpoints, then MERGE the relationship. An absent endpoint becomes a :Node placeholder rather than a dropped edge.
Session node writer re-keyed: MERGE (n:Session {…}) → MERGE (n:Node {…}) SET n:Session. Identity keys on the universal :Node label so a placeholder converges with the later typed write instead of forking into two nodes; the :Session label is still applied.
Schema (ensure_neo4j_schema): add a :Node(node_id, workspace) uniqueness constraint as the atomicity guard for concurrent MERGEs (the role the :Session constraint played for the Session MERGE); deduplicate (node_id, workspace) globally; run the universal-:Node backfill (previously unreachable behind an early return); drop the legacy plain idx_node_universal index (a uniqueness constraint carries its own backing index and cannot coexist with a standalone index on the same key); log backfill completion.
Sync the mocked schema/MERGE unit tests to the :Node-keyed identity.

Migration

The universal :Node label was introduced in #19, but its backfill shipped unreachable, so the graph holds nodes written before #19 that lack the label. ensure_neo4j_schema migrates idempotently — global dedup → backfill → verify → drop legacy index → :Node constraint — before any write. On the large production graph, run it as a verified two-phase deploy. Full runbook, verification queries, and rollback: docs/node-identity-migration.md.

Evidence

tests/neo4j (real Neo4j): 69 passed — includes a RED→GREEN proof of the silent drop (tests/neo4j/test_silent_edge_drop.py) and a dirty-graph migration test (tests/neo4j/test_node_identity_migration.py).
Non-Neo4j unit suite: 1373 passed, 2 skipped.
ruff clean on all changed files.

Reviewer checklist

ensure_neo4j_schema order: dedup → backfill → verify → drop legacy index → :Node constraint, before any write (_flush_body awaits _ensure_schema first).
Session writer MERGEs identity on :Node and SETs :Session.
Edge writer MERGEs both endpoints.
Concurrent-MERGE atomicity comes from the :Node uniqueness constraint.

…entity (v4.0.2) The chunked-flush edge writer built every cross-session edge with two MATCH clauses, silently dropping edges when endpoints were not yet committed. Fix: MERGE both endpoints to create :Node placeholders, re-key node identity on the universal :Node label so placeholders converge with later typed writes, add a :Node(node_id, workspace) uniqueness constraint as the atomicity guard for concurrent MERGEs, make the previously-dead universal-:Node backfill reachable, and drop the legacy plain index. The :Node label was introduced in PR #19 but its backfill shipped unreachable, leaving pre-#19 nodes untagged. The migration at startup (dedup → backfill → verify → drop legacy index → :Node constraint) safely brings the production graph into the state B′ assumes, before any write. Full operator runbook in docs/node-identity-migration.md. Bumps server to v4.0.2. Evidence: - tests/neo4j: 69 passed (includes RED→GREEN silent-drop proof + dirty-graph migration test) - Non-neo4j unit suite: 1373 passed, 2 skipped - ruff: clean on all changed files 🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>

bkrabach · 2026-06-19T21:05:00Z

Tested #21 against a clone of a real 1.4M-node graph before merge — green light from our side. 👍

We snapshotted a live context-intelligence graph (1,398,163 nodes / 1,937,529 rels) into a throwaway Neo4j and ran the migration there. Production was never touched.

Tests (independent run):

tests/neo4j/ -m neo4j → 69 passed
unit (-m "not neo4j and not integration") → 1326 passed
ruff clean

ensure_neo4j_schema on the real graph:

Scenario	Result
Graph as-is (already `:Node`-tagged from #19)	9.2s · `returned True` · 1 stray duplicate removed · `:Node` constraint created · 0 remaining dup groups
`:Node` stripped from all 1.4M nodes, re-run (exercises the resurrected backfill)	9.8s · backfilled 1,398,162 nodes → 0 untagged · constraint created · 0 dups · 0 data loss

So: ~10s end-to-end at real scale, no OOM, no errors. The dead-code backfill now runs and completes, and the drop-legacy-index → create-:Node-constraint handoff lands cleanly (no window where the graph is left without a backing index).

One non-blocking suggestion: the Step-1 global dedup (MATCH (n) WITH ... collect(n) ... DETACH DELETE) is the only un-batched pass in the migration. It was fine here — our graph had exactly 1 duplicate, so the collect groups are trivial — but it materializes a group-map over all nodes in heap. We tested on a large (ergonomic) JVM heap, so we can't certify it's OOM-proof on a memory-constrained Neo4j. Worth wrapping in CALL { ... } IN TRANSACTIONS like the backfill right below it — cheap insurance against the exact unbounded-transaction shape that motivated the chunking work originally.

Net: safe to merge from a real-data-scale standpoint. The self-healing edge writer + backfill resurrection both check out. Nice work.

colombod · 2026-06-19T22:01:36Z

Thanks for running it against a real 1.4M-node clone, @bkrabach — that's the validation that matters here.

On the Step-1 dedup batching suggestion: I tried it two ways and both made things worse on a memory-constrained Neo4j, so I'm keeping the un-batched form. Evidence:

Wrapping the DETACH DELETE in CALL { … } IN TRANSACTIONS — both the original collect(n) form and a lighter scalar-count(*) driving form — trips the 2 MiB db.memory.transaction.max integration test (tests/neo4j/test_orphan_visibility.py::test_finalization_orphan_surfaces_on_status) with MemoryPoolOutOfMemoryError.
Root cause: CALL { … } IN TRANSACTIONS eagerly materializes its driving rows before the first inner batch, and the dedup needs a grouping aggregation to find duplicate (node_id, workspace) keys — so peak memory goes up on a constrained heap, not down. The :Node backfill can use IN TRANSACTIONS safely only because its driving query is a plain streaming MATCH (n) WHERE NOT n:Node with no aggregation buffer.
The un-batched dedup passes that exact 2 MiB-capped test (and the full tests/neo4j suite, 69 passed) in addition to your 1.4M-node run — i.e. it's the form that actually survives the memory-constrained scenario the suggestion was guarding against.

If we ever need hard bounding for a pathologically duplicated graph, the right shape is a streaming per-key delete that avoids any full-graph grouping — but that wants the :Node index in place first, so it's a post-migration follow-up rather than something to fold into this pass. For the data we have (your clone had a single duplicate group), the un-batched dedup is correct and bounded enough. Leaving it as-is.

colombod merged commit 550c127 into main Jun 19, 2026
3 checks passed

colombod deleted the fix/cross-session-edge-self-heal-node-identity branch June 19, 2026 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(neo4j): self-healing cross-session edge writes + unified :Node identity (v4.0.2)#21

fix(neo4j): self-healing cross-session edge writes + unified :Node identity (v4.0.2)#21
colombod merged 1 commit into
mainfrom
fix/cross-session-edge-self-heal-node-identity

colombod commented Jun 19, 2026

Uh oh!

bkrabach commented Jun 19, 2026

Uh oh!

colombod commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

colombod commented Jun 19, 2026

Summary

Root cause

Fix

Migration

Evidence

Reviewer checklist

Uh oh!

bkrabach commented Jun 19, 2026

Uh oh!

colombod commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants